Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

نویسندگان

چکیده

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention recent years. Most existing methods focus on constructing contrastive pairs between whole videos complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level relationships among video-sentence, clip-phrase, frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention frame-level correlations adaptively cluster correlated frames into clip-level video-level representations. way, constructs video representations frame-clip-video granularities capture content, text at word-phrase-sentence the modality. With text, hierarchical learning is designed i.e., frame-word, which enables achieve comprehensive comparison modalities. Further boosted by adaptive label denoising marginal sample enhancement, achieves new state-of-the-art results various benchmarks, Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, 57.3% MSR-VTT, MSVD, LSMDC, DiDemo, ActivityNet, respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic-Centered Multi-Level Representations for Text Retrieval

Motivation: The amount of information widely available in electronic form is growing at an enormous rate. It is generally accepted that this holds great promise for applications as diverse as basic research, news, entertainment, and on-line social communities. Generally useful techniques for sifting through this mostly unstructured stuff are in great demand, as can be seen by the proliferation ...

متن کامل

News Video Retrieval using Multi-modal Query-dependent Model and Parallel Text Corpus

This paper describes a fully automated news video retrieval system that is capable of retrieving relevant shots using a multimedia query. The emphasis we adopted is three-fold. First, we explore the use multi-modal features such as speaker identification, video OCR, face recognition and Name-entities in ASR text, along with pseudo relevance feedback, for video retrieval. Second, we employ query...

متن کامل

Improving Cross-Language Text Retrieval with Human Interactions

Can we expect people to be able to get information from texts in languages they cannot read? In this paper we review two relevant lines of research bearing on this question and will show how our results are being used in the design of a new Web interface for cross-language text retrieval. One line of research, “Interactive IR”, is concerned with the user interface issues for information retriev...

متن کامل

Cross-modal Retrieval by Text and Image Feature Biclustering

We describe our approach to the ImageCLEF-Photo 2007 task. The novelty of our method consists of biclustering image segments and annotation words. Given the query words, we may select the image segment clusters that have strongest cooccurrence with the corresponding word clusters. These image segment clusters act as the selected segments relevant to a query. We rank text hits by our own tf.idf ...

متن کامل

Cross-modal Embeddings for Video and Audio Retrieval

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural net...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2022

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2022.3227973